You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
报错如下:
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.31s/it]
[INFO|modeling_utils.py:4888] 2025-05-26 16:46:08,068 >> All model checkpoint weights were used when initializing MiniCPMV.
[INFO|modeling_utils.py:4896] 2025-05-26 16:46:08,069 >> All the weights of MiniCPMV were initialized from the model checkpoint at /raid/zhanghang02/weights/MiniCPM-V-2_6.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MiniCPMV for predictions without further training.
[INFO|configuration_utils.py:1093] 2025-05-26 16:46:08,156 >> loading configuration file /raid/zhanghang02/weights/MiniCPM-V-2_6/generation_config.json
[INFO|configuration_utils.py:1140] 2025-05-26 16:46:08,156 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-05-26 16:46:08] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[INFO|2025-05-26 16:46:08] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.misc:143 >> Found linear modules: q_proj,v_proj,up_proj,k_proj,o_proj,down_proj,gate_proj
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.visual:143 >> Set vision model not trainable: ['vpm'].
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.visual:143 >> Set multi model projector not trainable: resampler.
[INFO|2025-05-26 16:46:08] llamafactory.model.loader:143 >> trainable params: 20,185,088 || all params: 8,119,360,240 || trainable%: 0.2486
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:741] 2025-05-26 16:46:09,007 >> Using auto half precision backend
[INFO|trainer.py:2369] 2025-05-26 16:46:09,246 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-05-26 16:46:09,246 >> Num examples = 109
[INFO|trainer.py:2371] 2025-05-26 16:46:09,246 >> Num Epochs = 300
[INFO|trainer.py:2372] 2025-05-26 16:46:09,246 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2375] 2025-05-26 16:46:09,246 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:2376] 2025-05-26 16:46:09,246 >> Gradient Accumulation steps = 1
[INFO|trainer.py:2377] 2025-05-26 16:46:09,246 >> Total optimization steps = 32,700
[INFO|trainer.py:2378] 2025-05-26 16:46:09,250 >> Number of trainable parameters = 20,185,088
0%| | 0/32700 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "/home/zhanghang02/anaconda3/envs/test1/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/cli.py", line 115, in main
COMMAND_MAPcommand
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 78, in _training_function
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 80, in run_dpo
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 2171, in train
return inner_training_loop(
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 2480, in _inner_training_loop
batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 133, in get_batch_samples
return Trainer.get_batch_samples(self, *args, **kwargs)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 5153, in get_batch_samples
batch_samples += [next(epoch_iterator)]
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/accelerate/data_loader.py", line 566, in iter
current_batch = next(dataloader_iter)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 733, in next
data = self._next_data()
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
return self._process_data(data, worker_id)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
data.reraise()
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/_utils.py", line 750, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/collator.py", line 264, in call
return super().call(concatenated_features)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/collator.py", line 157, in call
mm_inputs = self.template.mm_plugin.get_mm_inputs(
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/mm_plugin.py", line 1080, in get_mm_inputs
image_bounds = torch.hstack(
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 3 but got size 2 for tensor number 1 in the list.
0%| | 0/32700 [00:00<?, ?it/s]
Reproduction
Put your message here.
Others
No response
The text was updated successfully, but these errors were encountered:
Reminder
System Info
训练配置如下:
model
model_name_or_path: /raid/zhanghang02/weights/MiniCPM-V-2_6
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true
method
stage: dpo
do_train: true
finetuning_type: lora
freeze_vision_tower: true
lora_rank: 8
lora_target: all
pref_beta: 0.1
pref_loss: sigmoid # choices: [sigmoid (dpo), orpo, simpo]
dataset
dataset: dpo_test_video
template: minicpm_v
cutoff_len: 256
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 1
dataloader_num_workers: 1
output
output_dir: saves/minicpmv/lora/dpo
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 5.0e-6
num_train_epochs: 300.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
报错如下:
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.31s/it]
[INFO|modeling_utils.py:4888] 2025-05-26 16:46:08,068 >> All model checkpoint weights were used when initializing MiniCPMV.
[INFO|modeling_utils.py:4896] 2025-05-26 16:46:08,069 >> All the weights of MiniCPMV were initialized from the model checkpoint at /raid/zhanghang02/weights/MiniCPM-V-2_6.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MiniCPMV for predictions without further training.
[INFO|configuration_utils.py:1093] 2025-05-26 16:46:08,156 >> loading configuration file /raid/zhanghang02/weights/MiniCPM-V-2_6/generation_config.json
[INFO|configuration_utils.py:1140] 2025-05-26 16:46:08,156 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-05-26 16:46:08] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[INFO|2025-05-26 16:46:08] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.misc:143 >> Found linear modules: q_proj,v_proj,up_proj,k_proj,o_proj,down_proj,gate_proj
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.visual:143 >> Set vision model not trainable: ['vpm'].
[INFO|2025-05-26 16:46:08] llamafactory.model.model_utils.visual:143 >> Set multi model projector not trainable: resampler.
[INFO|2025-05-26 16:46:08] llamafactory.model.loader:143 >> trainable params: 20,185,088 || all params: 8,119,360,240 || trainable%: 0.2486
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:741] 2025-05-26 16:46:09,007 >> Using auto half precision backend
[INFO|trainer.py:2369] 2025-05-26 16:46:09,246 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-05-26 16:46:09,246 >> Num examples = 109
[INFO|trainer.py:2371] 2025-05-26 16:46:09,246 >> Num Epochs = 300
[INFO|trainer.py:2372] 2025-05-26 16:46:09,246 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2375] 2025-05-26 16:46:09,246 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:2376] 2025-05-26 16:46:09,246 >> Gradient Accumulation steps = 1
[INFO|trainer.py:2377] 2025-05-26 16:46:09,246 >> Total optimization steps = 32,700
[INFO|trainer.py:2378] 2025-05-26 16:46:09,250 >> Number of trainable parameters = 20,185,088
0%| | 0/32700 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using
tokenizers
before the fork if possible- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "/home/zhanghang02/anaconda3/envs/test1/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/cli.py", line 115, in main
COMMAND_MAPcommand
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 78, in _training_function
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 80, in run_dpo
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 2171, in train
return inner_training_loop(
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 2480, in _inner_training_loop
batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 133, in get_batch_samples
return Trainer.get_batch_samples(self, *args, **kwargs)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/transformers/trainer.py", line 5153, in get_batch_samples
batch_samples += [next(epoch_iterator)]
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/accelerate/data_loader.py", line 566, in iter
current_batch = next(dataloader_iter)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 733, in next
data = self._next_data()
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
return self._process_data(data, worker_id)
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
data.reraise()
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/_utils.py", line 750, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/home/zhanghang02/anaconda3/envs/test1/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/collator.py", line 264, in call
return super().call(concatenated_features)
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/collator.py", line 157, in call
mm_inputs = self.template.mm_plugin.get_mm_inputs(
File "/home/zhanghang02/factory/LLaMA-Factory/src/llamafactory/data/mm_plugin.py", line 1080, in get_mm_inputs
image_bounds = torch.hstack(
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 3 but got size 2 for tensor number 1 in the list.
0%| | 0/32700 [00:00<?, ?it/s]
Reproduction
Others
No response
The text was updated successfully, but these errors were encountered: